Evaluation of within-host evolution of methicillin-resistant Staphylococcus aureus (MRSA) by comparing cgMLST and SNP analysis approaches

Whole genome sequencing (WGS) of methicillin-resistant Staphylococcus aureus (MRSA) provides high-resolution typing, facilitating surveillance and outbreak investigations. The aim of this study was to evaluate the genomic variation rate in MRSA, by comparing commonly used core genome multilocus sequencing (cgMLST) against single nucleotide polymorphism (SNP) analyses. WGS was performed on 95 MRSA isolates, collected from 20 carriers during years 2003–2019. To assess variation and methodological-related differences, two different cgMLST schemes were obtained using Ridom SeqSphere+ and the cloud-based 1928 platform. In addition, two SNP methods, 1928 platform and Northern Arizona SNP Pipeline (NASP) were used. The cgMLST using Ridom SeqSphere+ and 1928 showed a median of 5.0 and 2.0 allele variants/year, respectively. In the SNP analysis, performed with two reference genomes COL and Newman, 1928 showed a median of 13 and 24 SNPs (including presumed recombination) and 3.8 respectively 4.0 SNPs (without recombination) per individual/year. Accordantly, NASP showed a median of 5.5 and 5.8 SNPs per individual/year. In conclusion, an estimated genomic variation rate of 2.0–5.8 genetic events per year (without recombination), is suggested as a general guideline to be used at clinical laboratories for surveillance and outbreak investigations independently of analysis approach used.


Results
Molecular typing of MRSA in long-term carriers. All MRSA isolates (n = 95) were successfully whole genome sequenced and subsequently analysed. Four samples, two from patient 5, one from patient 9 and one from patient 12 were excluded due to change of sequence type (ST) and complex type (CT), showing an obvious acquisition of a different MRSA strain during the follow up as shown in Table 1. The remaining samples (n = 9) for these patients were included for detailed analyses.
Twenty-one different spa types were found in the collection. Eight samples were re-analysed with Sanger sequencing, of these samples, five were re-run to verify a change of spa within a patient during the follow up. The remaining three samples were re-run due to failure with WGS to define a spa type but successfully analysed with Sanger sequencing.
Two different CT and spa types were present in patient 10, therefore, this patient's samples were divided in two groups, 10.1 and 10.2 as shown in Table 1.
The relatedness of the isolates included in the analysis is shown in the phylogenetic tree generated from the CT analysis performed with Ridom SeqSphere+ in Fig. 1.
Comparisons of within-host evolution by cgMLST approaches in MRSA carriers. The cgMLST analysis using Ridom SeqSphere+ showed a median of 5.0 (range 0.5-10.8) allele variants/year per individual while the analysis by 1928 displayed a median of 2.0 (range 0.5-6.8) allele variants/year. The summary of the estimated genomic variation rate for each patient and method are shown in Table 2.
Comparison of within-host evolution by SNP approaches. The 1928 SNP analysis, excluding recombination events, showed a median of 4.0 (range 0.75-8.7) and 3.8 (range 0.8-8.8) SNP changes per individual/year whilst the analysis, including recombination events, showed a median of 23.6 (range 8.4-47.0) and 13.0 (range 3.6-22.3) using the COL-and Newman references, respectively. The considered percentage of the genome compared to the reference ranged from 85.4 to 92.4% for the COL reference and 82.6% to 91.3% for the Newman reference. The corresponding results for the NASP pipeline were 5.8 (range 2.2-12.5) and 5.5 (range 1.2-12.3) changes per individual/year (Table 2) were the considered percentage of the genome compared to the reference was 80.53% for the COL reference and 77.98% for the Newman reference.
Comparison of estimated genomic variation rate according to ST-type. Our results showed a genomic variation rate between the eleven different STs found. The dominant clone ST88 showed an estimated genomic variation rate of 4.5 alleles/year, 5.6 SNP/year (NASP) and 4.1 SNP/year (1928 WOR). The ST1 (n = 2) and ST1340 (n = 6) showed the lowest and highest estimated genomic variation rate of 1.5 and 10.8 alleles/year and 2.5 and 12.5 SNP/year (NASP) whereas the rate for 1928 (WOR) was 1.0 and 8.7 SNP/year, respectively. The estimated genomic variation rate for each type was calculated by adding the genomic variation rate for each of the STs divided with the total number of years among the patients within the same ST as shown in Table 3 and graphically in Fig. 2.

Discussion
For correct interpretation of temporal divergence of bacterial lineages of SNPs and allele differences, knowledge about the genomic variation rate of MRSA is critical. Molecular methods such as typing of the bacteria are used to confirm if a new case of MRSA belongs to a known clone or if it is unique. Surveillance of MRSA is essential in order to identify modes and routes and to prevent or limit further transmission. Nowadays many laboratories  14,15 . The continuous development of high-throughput measurement techniques as WGS produces large numbers of data sets therefore access to bioinformatics knowledge is the key in rapidly computing and interpreting the information obtained from the analysis. Hence, more user-friendly bioinformatics platforms need to be developed to allow easier interpretation, management and exploration of data for personnel who do not have bioinformatics expertise.
In the present study, we describe how different ways to analyse WGS data for MRSA using both SNP and cgMLST affects the estimated genomic variation rate of MRSA within carriers. WGS combined with user-friendly software, such as SeqSphere+, cloud-based pipelines like 1928 and bioinformatics tools including NASP, reveals more detailed genetic information than classical typing technologies. Genetic mutations in organisms causes diversity and can be beneficial, harmful, or neutral, depending on their context or location. Previous studies have used mutation rate as a marker for relatedness between isolates during outbreaks and for patient-to-patient transmission and have demonstrated that the estimated mutation rate of MRSA can range from 2-10 SNP/year 10,[14][15][16][17] . In this study the estimated genomic variation rate was found to be within a range of 2.0-5.0 alleles/year with the Table 1. Classification of samples according to its multilocus sequencing type (MLST), core genome MLST (Complex type) and Staphylococcus aureus protein A (spa) per patient and year performed with SeqSphere+. The change of MRSA strain appears clearly (orange) for patients 5, 9 and 12 therefore the samples (n = 4) were excluded from the calculations. Patient 10 has two of five samples with a different CT-and spa-type consequently, patient 10 was divided in two groups, patient 10.1 (dark colour) and patient 10.2 (light colour). Three samples (light yellow) were re-analyse for spa with Sanger sequencing due to failure with WGS to define spa. In addition, five more samples; two samples from patient 2, two samples from patient 6 and one sample from patient 16 were also sequenced with Sanger to verify the results obtained with WGS due to change of spa type within the patient during the follow up.   (Table 1) and consequently four samples were excluded from the overall calculations: two samples from patient 5, one from patient 9 and one from patient 12. The detection of another MRSA strain can be explained with intra-host diversification where the bacterial strain can differ between infection and colonising sites so both lineages can be carried at all-time points or colonisation from a close relative [18][19][20] . Overall, the results showed that the genomic variation rate does not differ much between the methods used if recombination events are not included in the analysis as shown in Table 2. Recombination events, however, are handled differently by cgMLST if compared to a SNP alignment, as recombinant regions with a high density of SNPs can be filtered, while cgMLST methods will collapse these regions into a smaller number of allelic changes 11 . Although the collection of samples for making a more categorized estimation according to the different STs is small, our results are comparable with the results obtained in previous studies and we can see that the genomic variation rate seems to vary within the different STs 10,14-17 .
The majority of samples belonged to the group of CA-MRSA, which are known for being more virulent and transmissible than HA-MRSA causing infection even in healthy individuals 21 . CA-MRSA was defined through identification of MRSA and fulfilled the following criteria; outpatient care, positive within 48 h of admission to a hospital, no medical history of MRSA infection or colonisation, among others 22 . The CA-MRSA clone ST88 was found to be the dominant clone in our collection (n = 29). ST88 has been found in many countries including Sweden and is the predominant clone in Sub-Saharan Africa 23 . In our collection, ST88 showed an estimated genomic variation rate that fits into the estimated genomic variation rate for the whole collection, which is not that surprising due to it is a quite large proportion of the collection. The estimated genomic variation rate however, varied between the other STs. In the present study, spa typing was performed with SeqSphere+ using WGS data, however, the use of WGS did not succeed recognising one of the spa types. Therefore, spa typing with Sanger sequencing was performed for the isolates where the spa type had not succeeded and additionally in the cases where only the spa type differed (same ST and CT) in isolates from the same patient. Sanger sequencing www.nature.com/scientificreports/ verified this change of spa types to be correct and the isolate that failed was found to be t458. The spa type t458 surprisingly contained only one repeat, which indicates fail in the data analysis and not the WGS, since short read WGS is known for having difficulties to analyse spa types with many similar repeats 24 . In our laboratory, we use WGS for routine typing and surveillance of MRSA. Here we also included the spa type due to many laboratories still use spa typing as standard method for typing of MRSA even though it lacks of discrimination power in outbreak investigations [25][26][27] . The selection of a reference genome with a well-defined core genome can be a source of bias if it is not closely related to the analysed sequences and can affect the subsequent analyses such as the detection of genetic events 28 . This could explain why the cgMLST analysis performed with 1928 showed a lower estimated genomic variation  www.nature.com/scientificreports/ rate compared to SeqSphere+ since they are based on two different references, i.e. Newman and COL. The core genome from the Newman's reference is only based on 1704 genes compared to the COL reference genome that is 1861 genes resulting in a difference of 157 genes, which could support the variation of random genetic variations. In contrast to the cgMLST results based on different core genomes, the SNP-based analyses performed with NASP and 1928 actually showed comparable results, independently of the reference genome used. Otherwise, it is usually more difficult to compare results from SNPs analyses when using different references since anything that does not match the reference will be excluded from the results 11 . Therefore, the choice of reference genome in a SNP analysis is known to be of high importance and needs to be very closely related to the isolates if high resolution is the goal, as in outbreak investigations. A few limitations need to be considered in our study. The genomic variation rate per ST was calculated using a small collection of samples. However, our results indicate a variation of genomic variation rate between different STs, therefore, it is important to investigate a larger sample collection since different STs dominate in different parts of the world. In addition, in this study, the mutation rate was studied within-host and not between carriers, which often is the case during an outbreak investigation. Another limitation is that in the clinical laboratory, only one random colony is chosen to analyse further and all the molecular results are thus based on this colony. This might have led to a selection of the more dominant strain during a certain period of time although several strains might have been present in the patient's flora. Therefore, this could have affected the presence of "new" strains in some patients. Another possibility is that the patient did acquire a new strain through infection from another person or that the carrier possesses a mixed MRSA-population that is (occasionally) more predominant [18][19][20] . One important aspect to consider about NASP is that it does not actively consider recombinant sites. To take recombination events into account the data requires further analyses, which was not done since few SNPs between person-isolates were observed.
In conclusion, the result of this study suggest that an estimated genomic variation rate of 2.0-5.8 (allele or SNP) per year can be used as a guideline in clinical laboratories for surveillance and outbreak investigations, regardless of the analysis method used i.e. cgMLST or SNP if excluding recombination events. However, it should be taken into account that the estimated genomic variation rate may vary in different MLST sequence types but also the choice of reference genome and typing method used due to the overall variability and suitability they possess.

Materials and methods
Ethical considerations. In the present study only subcultured bacterial isolates that were previously collected and retrieved, as part of the Swedish monitoring program, was stored. No tissue material or other biological material from patients was stored. No patient information was collected. Thus, the requirement for patient consent was waived and therefore ethical approval was not required.
Sample selection. Ninety-five MRSA isolates originating from twenty long-term carriers were collected from 2003 to 2019 by clinical routine and as part of the national monitoring program. One isolate per patient and per annual sampling (range 277 -388 days (supplementary Table S1)) had been stored at − 80 °C in preservation medium (trypticase soy broth, BD, Diagnostic Systems, Sparks, MD, USA), supplemented with 0.3% (w/v) yeast extract (BD Diagnostic Systems) and 29% horse serum (SVA, Uppsala, Sweden) in clinical routine at the Department of Laboratory Medicine, Clinical Microbiology, at Örebro University Hospital, Örebro, Sweden. The number of samples (n = 95), obtained from the 20 long-term MRSA carriers, ranged from two to 9 samples per patient (Fig. 3) and the sample collection time frame ranged from two to 11 years (supplementary Table S2). A detailed flowchart for the entire analyse chain is shown in Fig. 4.   Table S2). A minimum coverage of more than 50-fold was considered acceptable quality for the SNP analysis. Two publicly available S. aureus reference genomes, COL (NC_002951.2) 29, 30 and Newman (NC_009641) 31 , were used for alignment of the samples in each method. The chosen references in this study are based on the default settings for each cgMLST approach, which are used in the clinical routine for NGS analysis at the microbiology laboratory at Örebro University hospital.

Culture conditions and
cgMLST analysis and molecular typing by Ridom SeqSphere+. The reads were de novo assembled using Velvet (General public licence version 2.0) integrated in Ridom SeqSphere+ 32 using default settings: reads were trimmed until the average base quality of 30 was reached in a window of 20 bases. The samples were aligned to the COL reference sequence analysed by cgMLST using the software Ridom SeqSphere+ scheme based on 1861 genes 33,34 . More than 50-fold average coverage with an average read length of > 200 bp, and percentage good targets for cgMLST ≥ 97% was considered acceptable sequence quality for further data analysis. A phylogenetic tree was constructed in SeqSphere+ using their neighbour-joining tree algorithm (Fig. 1) and distance matrix calculation 35 was performed; missing values were pairwise ignored. Additional information was extracted regarding the ST, CC, spa and CT. The CT is an additional identifier that clusters together samples with very similar cgMLST profiles and every combination of genomic allele profiles is given a unique identifier. The threshold used to define the CT in the cgMLST analysis was up to 24 allele differences (https:// www. cgmlst. org/ ncs/ schema/ 141106/) 10 . cgMLST analysis by 1928 platform. For the 1928 platform, core gene data for all sequenced bacteria were extracted by a k-mer based allele calling method. The core genome scheme is based on 1704 genes built on the Newman reference genome 31 . More than 30-fold average coverage over the whole genome in combination with ≥ 95% of good cgMLST targets are considered acceptable sequence quality. The 1928 platform's cgMLST method uses a custom developed allele-calling algorithm based on an alignment-free k-mer approach. Phylogenetic trees were constructed using 1928 UPGMA-tree algorithm and distance matrix calculations were made by calculating all pairwise allelic differences between samples, missing genes were pairwise ignored 36